The task is to predict future stock prices and calculate future returns. I will be working with an assumption that stocks X1 to X56 are not correlated, i.e. price or X1 does not affect or correlate with price of X2-X56)
In this workbook, I will work-through the process of data exploration, analysis and build a process pipeline to achieve the above objective.
I will first start with exploring one stock. I will load the data and convert it to time-series.
Dataset: dailyPrices_AtClose.csv This dataset consists of Time(t) and Stocks (X1- X56) as shown below:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# plot utilities
%matplotlib inline
import matplotlib.pyplot as plt
import mpld3
mpld3.enable_notebook()
# load dailyPrices
tmp = pd.read_csv('dailyPrices_AtClose.csv')
tmp.head(5)
# convert data to time-series data while reading from csv itself
dailyPrices = pd.read_csv('dailyPrices_AtClose.csv', parse_dates='t()', index_col='t()')
#dailyPrices.index # check if the index is datetime
# re-index and set frequency='D'
dates = pd.date_range(dailyPrices.index[0], dailyPrices.index[-1], freq='D')
dailyPrices = dailyPrices.reindex(dates)
dailyPrices.index
# display data
dailyPrices.head(5)
dailyPrices.describe()
Few stocks have lots of missing data. I will choose "X3" as my stock to perform some exploratory analysis. The reason for choosing X3 is that it appears to have maximum amount of data.
# display one stock data
dailyPrices.X3.plot()
plt.show()
We have now converted the dailyPrices data to time-series data. Looking at the y-axis, it appears that they are ordered in chronological order with latest being last, so there is no need to sort the data based on time.
Time series data needs to be Stationary[1] in order to perform statistical analysis or modelling. Let us test if this time-series data is stationary.
References:
I will pick one stock to perform this test, i.e. X3. First, I will plot a few moving average curve to visually check if the data has constant mean and constant variance.
# create a data frame for X1
dframe = pd.DataFrame(dailyPrices.index)
dframe['P(t)'] = dailyPrices.X3.values
df = dframe['P(t)'][~dframe['P(t)'].isnull()]
# moving averages
dframe['MA_30'] = pd.Series.rolling(df, center=False, window=30).mean() # 30 -day MA
dframe['MA_60'] = pd.Series.rolling(df, center=False, window=60).mean() # 60 -day MA
dframe['MA_120'] = pd.Series.rolling(df, center=False, window=120).mean() # 120 -day MA
# standard devs
dframe['SD_30'] = pd.Series.rolling(df, center=False, window=30).std() # 30 -day SD
dframe['SD_60'] = pd.Series.rolling(df, center=False, window=60).std() # 60 -day SD
dframe['SD_120'] = pd.Series.rolling(df, center=False, window=120).std() # 120 -day SD
# plot
dframe.loc[:, ['MA_30', 'MA_60', 'MA_120']].plot( title='Moving Averages - X1')
dframe.loc[:, ['SD_30', 'SD_60', 'SD_120']].plot( title='Standard Devs - X1')
Looking at the plots of moving averages and standard deviations, it is apparent that this time-series is non-stationary in mean and variance.
Let us do a Unit-Root test[2] to confirm that the time-series is non-stationary. I will use Dickey-Fuller[3] test below:
References:
# Dickey-Fuller Test
from statsmodels.tsa.stattools import adfuller
tseries = dailyPrices.X3
# discard missing data
tseries = tseries.dropna()
df = adfuller(tseries)
print(df)
Test-Statistic = -1.93 and Critical Values are -3.43, -2.86 and -2.86. And the p-value is greater than 0. Test-Statistic is not lesser than Critical values hence this time-series has Unit Root and is non-stationary.
I will now wrap this into a function.
def testStationarity(ts=None, displaySummary=True, displayPlot=True):
''' Test whether the input series is stationary or not
'''
# remove NAN's
ts = ts.dropna()
# create a data frame for X1
dframe = pd.DataFrame(ts.index)
dframe['P(t)'] = ts.values
d = dframe['P(t)'][~dframe['P(t)'].isnull()]
# dickyey-fuller test
df = adfuller(ts)
if df[0] < df[4]['1%']:
confi = 0.99
isStationary = True;
strStationary = ' DFTest: Stationary'+ ' (confidence= %.2f)' % confi
elif df[0] < df[4]['5%']:
confi = 0.95
isStationary = True;
strStationary = ' DFTest: Stationary'+ ' (confidence= %.2f)' % confi
elif df[0] < df[4]['10%']:
confi = 0.90
isStationary = True;
strStationary = ' DFTest: Stationary' + ' (confidence= %.2f)' % confi
else:
confi = 0
isStationary = False;
strStationary = ' DFTest: Non Stationary'
# moving averages
dframe['MA_30'] = pd.Series.rolling(d, center=False, window=30).mean() # 30 -day MA
dframe['MA_60'] = pd.Series.rolling(d, center=False, window=60).mean() # 60 -day MA
dframe['MA_120'] = pd.Series.rolling(d, center=False, window=120).mean() # 120 -day MA
# standard devs
dframe['SD_30'] = pd.Series.rolling(d, center=False, window=30).std() # 30 -day SD
dframe['SD_60'] = pd.Series.rolling(d, center=False, window=60).std() # 60 -day SD
dframe['SD_120'] = pd.Series.rolling(d, center=False, window=120).std() # 120 -day SD
if displayPlot:
# plot
dframe.loc[:, ['MA_30', 'MA_60', 'MA_120']].plot( title='Moving Averages - ' + ts.name + strStationary)
dframe.loc[:, ['SD_30', 'SD_60', 'SD_120']].plot( title='Variance - ' + ts.name + strStationary)
if displaySummary:
print(df)
print('Time series - '+ts.name + ',' + strStationary)
return isStationary
test = testStationarity(dailyPrices.X3)
print(test)
I will try to determine if the price is correlated to prices at other timestamps via autocorrelation and partial autocorrelation.
# auto correlation
from statsmodels.tsa.stattools import acf
from statsmodels.tsa.stattools import pacf
acf_coeffs = acf(dailyPrices.X3.dropna())
# plot
plt.plot(acf_coeffs, marker='o',linestyle='-')
plt.title('Autocorrelation')
plt.show()
pcf_coeffs = pacf(dailyPrices.X3.dropna())
# plot
plt.plot(pcf_coeffs, marker='o',linestyle='-')
plt.title('Partial autocorrelation')
plt.show()
Autocorrelation plot suggests that the price is heavily correlated with the previous 40-days. Converting non-stationary to stationary might transform the dependent variable to independent variable.
Partial-autocorrelation reveals that the maximum amount of correlation is from lag-1. This information might help me in choosing some of the modelling parameters such Auto-regressive (AR) parameter for modelling.
We have now ascertained that the timeseries is non-stationary. I will now attempt to convert non-stationary time-series to stationary time-series.
First step would be to transform the timeseries data. Stock prices are considered to be lognormal distributions [4].
References:
One straightforward way of converting non-stationary timeseries to stationary is via simple differencing. For example, simple-differencing in raw, log, square-root or cube-root transformations could be tried. I will perform simple-differencing on raw, log and square-root transformed data just for sake of completeness.
# Let me plot simple-difference data
tseries = dailyPrices.X3
tseries = tseries.dropna()
def simpleDiff(ts=None):
''' Simple differencing using shift=+1
'''
return ts - ts.shift(1)
# simple-differencing on raw
simd_raw = simpleDiff(tseries)
# simple differencing on log
simd_log = simpleDiff(np.log(tseries))
# simple differencing on square-root
simd_sq = simpleDiff(tseries**(0.5))
# plot
f, axarr = plt.subplots(3, sharex=True)
axarr[0].plot(simd_raw)
axarr[0].set_title('X axis')
axarr[1].plot(simd_log)
axarr[2].plot(simd_sq)
From the three plots above, simple differencing on raw, log or square-root does not appear to be producing increasing variances over time. We may have converted non-stationary to non-stationary via such simple methods. However, let us test it using testStationarity() function.
# test raw
print("==== Testing simpleDiff on raw data for Stationarity ======")
testStationarity(simd_raw, displayPlot=False)
print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print("==== Testing simpleDiff on raw data for Stationarity ======")
testStationarity(simd_log, displayPlot=False)
print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print("==== Testing simpleDiff on square-root for Stationarity ======")
testStationarity(simd_sq, displayPlot=False)
print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
All three conversion methods provide stationary timeseries with confidence = 0.99. The Test-Statistic is very low compared to Critical 1% values and p-value=0 for raw-data and square-root data compared to log-transformed timeseries. Hence, all three methods do not have unit-root. Based purely on the Dickey-Fuller test results, using simple-differencing on square-root data would be the best pick for conversion. However, using log-scale has its advantageous [4]. Considering that the log-transformed data has proven to generate good enough stationary timeseries, I will pick log-based approach.
References:
I will now try 3-day returns on log transformed data to check if that can yield an improved stationary timeseries. In log-scale, difference between P(T) and P(T+3) will directly yield a log-3day Return.
# 3 day returns
tseries = dailyPrices.X3
day3Returns = np.log(tseries) - (np.log(tseries)).shift(3)
print("==== Testing 3Day-returns on log data for Stationarity ======")
testStationarity(day3Returns, displayPlot=False)
print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
The task is to forecast returns at P(T+3), i.e. 3-day returns. Using 3-day-returns as the stationary series will make the final output easier to interpret without further data-manipulation. However, comparing the difference between Test-Statistic <=> Critical Values and p-values of simple-differencing, it appears the 1-day-returns (log differences) is more desirable than 3-day-returns.
May be I should try removing trends from log-transformed data, this could provide some improvement compared to simple methods. I will use a window of 30-days to perform moving-average smoothing.
# log transform
tseries = dailyPrices.X3
tseries_log = np.log(tseries.dropna())
ma_30 = pd.Series.rolling(tseries_log, center=True, window=30).mean()
# plot
plt.plot(tseries_log)
plt.plot(ma_30,'r')
# remove ma_30 from tseries_log
tseries_log_ma30diff = tseries_log - ma_30
# test for stationarity
test = testStationarity(tseries_log_ma30diff)
Although, it has created a stationary time-series with confidence=0.99, the difference between Test-Statistic and Critical 1% is not much compared to simple methods. Maybe removing an exponential-weighted moving average instead of moving-average might give a better result.
# remove exponential moving average from tseries_log
tseries_log_ema30diff = tseries_log - pd.Series.ewm(tseries_log, halflife=30).mean()
# test for stationarity
test = testStationarity(tseries_log_ema30diff, displayPlot=False)
An exponential-weighted moving average gives a stationary time-series with confidence = 0.95. However, the difference between Test-Statistic <=> Critical 1% values is again not much compared to the simpler methods.
I will now try something different. I will attempt to smooth the log-transformed data and then convert to daily-returns to see if that can provide any other insight.
#log transform
tseries = dailyPrices.X3
tseries_log = np.log(tseries.dropna())
#smoothed data
ma_30 = pd.Series.rolling(tseries_log, center=True, window=30).mean()
# day-returns
day_returns = ma_30 - ma_30.shift(1)
# 3-day returns
day3_returns = ma_30 - ma_30.shift(3)
# test
print("==== Testing Day-returns on smoothed log data for Stationarity ======")
testStationarity(day_returns, displayPlot=False)
print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print("==== Testing 3Day-returns on smoothed log data for Stationarity ======")
testStationarity(day3_returns, displayPlot=False)
print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
The dickey-fuller test statistics are not better compared to the simple-differencing of log-transformed data.
Simple-differencing yields a stationary time-series with confidence=0.99 and Test-Statistic is very less than Critical 1% values when compared to other types of conversions. I will use simple differencing on log transformed timeseries to convert non-stationary timeseries to stationary timeseries.
I will now wrap this conversion into a function.
def convert2Stationary(ts=None, disp=True):
''' Convert a non-stationary time-series to stationary using simple differencing of
log-transformed timeseries
The input time-series is first checked for stationarity then the conversion is done
input: ts -> time-series in normal price
output: stationary timeseries
'''
# log
ts_log = np.log(ts.dropna())
out = ts_log
if not testStationarity(ts, displayPlot=False, displaySummary=False):
# differencing
out = ts_log - ts_log.shift(1)
# perform test
if disp:
print("=== after conversion test for Stationarity===")
testStationarity(out, displayPlot=disp, displaySummary=disp)
return out
ts_new = convert2Stationary(dailyPrices.X3)
Let me perform a quick test to find out if converting a non-stationary timeseries to stationary has injected independence into the variable. For this, I will use autocorrelation.
# acf and pacf
statTS = convert2Stationary(dailyPrices.X3, False)
statTS = statTS.dropna()
acf_coeffs = acf(statTS)
# plot
plt.plot(acf_coeffs, marker='o',linestyle='-')
plt.title('Autocorelation')
plt.show()
# partial autocorelation
pcf_coeffs = pacf(statTS)
# plot
plt.plot(pcf_coeffs, marker='o',linestyle='-')
plt.title('Partial autocorelation')
plt.show()
Autocorrelation plot reveals that there appears to be no corelation to prices at other timestamps therefore are independent.
I will now try to decompose the timeseries for seasonality i.e. trend, seasonal and residuals.
# seasonal decompose
from statsmodels.tsa.seasonal import seasonal_decompose
#
day_returns = convert2Stationary(dailyPrices.X3, False)
# index-0 of day_returns will be NAN due to differencing.
# find median
#tmp = statTS.dropna()
#median_ts = tmp.median()
# fill NANs with median
# this is a hack to fill values, may be much more sophisticated filling technique
# using kalman filters etc might prove advantageous
#statTS_tmp = statTS.fillna(median_ts)
#testStationarity(statTS_tmp)
#
#decomp = seasonal_decompose(statTS_tmp)
decomp = seasonal_decompose(day_returns.dropna().values,freq=30)
decomp.plot()
plt.show()
Trend, seasonal and residual series were generated via decomposition of the timeseries. Trend, seasonal and residual series appears to be stationary. Residual appears to be same as observed data. I am unsure whether this information is useful or not. Do stocks have seasonality? Difficult to say so.
Steps completed:
I will now try to perform regression to generate a model, this model should help me in forecasting the future.
The most commonly used regression is Auto-Regressive Moving Average (ARMA)[5] model for timeseries prediction. I will use ARIMA function from Python to predict future prices.
References:
Stationary timeseries generated is now an independent variable with no corelations with other values at various timestamps. Hence, I set AR=>p=1, d=0, MA=>q=1.
# ARIMA
import statsmodels.tsa as tsa
#
day_returns = convert2Stationary(dailyPrices.X3, False)
# index-0 of day_returns will be NAN due to differencing.
model = tsa.arima_model.ARIMA(day_returns.ix[1:], order=(1,0,1)).fit(disp=-1)
print(model.summary())
#
#dates = pd.date_range(dailyPrices.index[0], dailyPrices.index[-1], freq='D')
#res_s = pd.Series(model.fittedvalues, index=dates)
plt.plot(day_returns.ix[1:])
plt.plot(model.fittedvalues, 'r')
plt.show()
I have now built a model using ARIMA but the fitted-values are not impressive at all. However,log-likelihood has a high value suggesting that the estimates arise from the data. The fitted-values appear to be underestimating/underfitting the data by a large margin.
Let me do some in-sample training and out-of-sample testing.
# in-sample training
day_returns = convert2Stationary(dailyPrices.X3, False)
# index-0 of day_returns will be NAN due to differencing.
model = tsa.arima_model.ARIMA(day_returns.ix[1:2000], order=(1,0,1)).fit(disp=-1)
print(model.summary())
#
plt.plot(day_returns.ix[1:])
plt.plot(model.fittedvalues, 'r')
plt.show()
Next, I will perform out of sample testing on a very small number of samples. I will also display some of the key insights into the predicted outcomes.
In order to calculate 3-day returns, I will first calculate a 4-day rolling-sum of day-returns and then subtract the current day's returns to determine 3-day returns. All returns are in log-scale.
# out of sample testing
preds = model.predict(1998, 2009)
orig = day_returns[day_returns.index[1998]: day_returns.index[2009]]
# original prices
#p = dailyPrices.X3.dropna()
#orig_nonstat_lagged = p[statTS_1.index[1498]: statTS_1.index[1509]]
#orig_p = p[statTS_1.index[1499]: statTS_1.index[1510]]
#
# Original-returns: calculate rolling_sum of log-returns over a window of 4 days, including current day
rollsum = pd.Series.rolling(orig, center=False, window=4).sum()
orig_day3_returns = orig - rollsum.shift(-3)
# Predicted-returns:
rollsum = pd.Series.rolling(preds, center=False, window=4).sum()
preds_day3_returns = preds - rollsum.shift(-3)
tmp = pd.DataFrame(orig.index)
tmp['Orig Log Returns'] = orig.values
tmp['Pred Log Returns'] = preds.values
tmp['AbsError'] = np.abs(orig.values-preds.values)
tmp['AbsPercentError(%)'] = np.abs((orig.values-preds.values)/orig.values) * 100
tmp['DirectionAccuracy'] = ~((orig.values > 0) != (preds.values > 0))
tmp['Orig 3-day Log returns'] = orig_day3_returns.values
tmp['Pred 3-day Log returns'] = preds_day3_returns.values
tmp['3-day AbsError'] = np.abs(orig_day3_returns.values - preds_day3_returns.values)
tmp['3-day AbsPercentError(%)'] = np.abs((orig_day3_returns.values - preds_day3_returns.values)/orig_day3_returns.values) * 100
tmp['3-day DirectionAccuracy'] = ~((orig_day3_returns.values > 0) != (preds_day3_returns.values > 0))
tmp
I have performed in-sample training using first 1999 ticks of timeseries data and then an out-of-sample testing using small subset of ticks (12 ticks). Absolute Percent Error for daily returns appears to be high, however, direction (up/down) accuracy has 5 correct out of 12 instances for log day-returns. This is a very rudimentary test and highly unreliable. Much more rigourous training and testing regime needs to followed to build a reliable model.
Looking at 3-day returns, again Absolute Percent Error is high however, direction accuracy is correct for 5 out of 9 instances.
Observations:
I will now try to forecast one-datapoint into the future:
# original timeseries
orig_tseries = dailyPrices.X3
# create a new time series by pushing it 3-days ahead and fill it with 0
pushed_tseries = pd.Series(0, orig_tseries.index + pd.Timedelta('3 day'))
# extract the last 3-days from pushed_tseries and append it to original timeseries
tseries = orig_tseries.append(pushed_tseries[-3 : pushed_tseries.shape[0]], verify_integrity=True)
# convert original timeseries X3 to stationary
day_returns = convert2Stationary(dailyPrices.X3, False)
# index-0 of day_returns will be NAN due to differencing.
# usually day_returns[original_length+1] will have -INF due to 0 insertion.
# modelling (use all data starting from index-1 day of day_returns to the lastday of the orignal timeseries)
model = tsa.arima_model.ARIMA(day_returns.ix[1:], order=(1,0,1)).fit(disp=-1)
print(model.summary())
# index of day_returns runs from 0:day_returns.shape[0]-1, there are total of day_returns.shape[0] elements
# we are ignoring inedx-0 while training (it might contain NAN due to differencing), hence we are only training
# with day_returns.shape[0]-1 elements (with indexing running from 1:day_returns.shape[0]-1).
# Therefore the trained-model will have a length of ((day_returns.shape[0]-1)-1)
# For prediction, the starting index should atleast be the last fittedvalue.
# predict
preds = model.predict(model.fittedvalues.shape[0]-1, model.fittedvalues.shape[0]+2 )
### calculate error metrics and accuracy of fittedvalues
# Original-returns: calculate rolling_sum of log-returns over a window of 4 days, including current day
orig_returns = day_returns[1:]
rollsum = pd.Series.rolling(orig_returns, center=False, window=4).sum()
orig_3day_returns = orig_returns - rollsum.shift(-3)
# fitted-returns:
rollsum = pd.Series.rolling(model.fittedvalues, center=False, window=4).sum()
fitted_3day_returns = model.fittedvalues - rollsum.shift(-3)
# display dataframe
statsFrame = pd.DataFrame(orig_returns.index)
statsFrame['Orig Log Returns'] = orig_returns.values
statsFrame['Pred Log Returns'] = model.fittedvalues.values
statsFrame['AbsError'] = np.abs(orig_returns.values - model.fittedvalues.values)
statsFrame['AbsPercentError(%)'] = np.abs((orig_returns.values - model.fittedvalues.values)/(orig_returns.values+0.00000001)) * 100
statsFrame['DirectionAccuracy'] = ~((orig_returns.values > 0) != (model.fittedvalues.values > 0))
statsFrame['Orig 3-day Log returns'] = orig_3day_returns.values
statsFrame['Pred 3-day Log returns'] = fitted_3day_returns.values
statsFrame['3-day AbsError'] = np.abs(orig_3day_returns.values - fitted_3day_returns.values)
statsFrame['3-day AbsPercentError(%)'] = np.abs((orig_3day_returns.values - fitted_3day_returns.values)/(orig_3day_returns.values+0.00000001)) * 100
statsFrame['3-day DirectionAccuracy'] = ~((orig_3day_returns.values > 0) != (fitted_3day_returns.values > 0))
# calculate
# create a fitted_day_returns series using index from new timeseries (created used pushed_timeseries) and fittedvalues
fitted_day_returns = pd.Series(model.fittedvalues, index=tseries.index)
# copy the forecast values to correct dates
fitted_day_returns[-4 : fitted_day_returns.shape[0]] = preds
forecast_day_returns = fitted_day_returns[-4 : fitted_day_returns.shape[0]]
# Predicted-returns:
rollsum = pd.Series.rolling(forecast_day_returns, center=False, window=4).sum()
forecast_3day_returns = forecast_day_returns - rollsum.shift(-3)
# Display
print("===== 3-day Returns forecast for closing price of last-day in timeseries ======")
print(" Note: The first row in the table corresponds to the last day \n closing price of the stock")
print(" Stock: "+ dailyPrices.X3.name)
futureFrame = pd.DataFrame(forecast_day_returns.index)
futureFrame['Forecast Log Returns'] = forecast_day_returns.values
futureFrame['Forecast 3-day Log Returns'] = forecast_3day_returns.values
futureFrame['Long term MAE (log returns)'] = np.mean(statsFrame['AbsError'])
futureFrame['Long term MAPE (log returns) %'] = np.mean(statsFrame['AbsPercentError(%)']/100)*100
futureFrame['Long term DirectionAccuracy (log returns) %'] = (sum(statsFrame['DirectionAccuracy'])/statsFrame.shape[0]) * 100
futureFrame['Long term MAE (3-day log returns)'] = np.mean(statsFrame['3-day AbsError'])
futureFrame['Long term MAPE (3-day log returns) %'] = np.mean(statsFrame['3-day AbsPercentError(%)']/100) * 100
futureFrame['Long term DirectionAccuracy (3-day log returns) %'] = (sum(statsFrame['3-day DirectionAccuracy'])/statsFrame.shape[0]) * 100
print('>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<')
futureFrame
The first row corresponds to the last-day closing price of the stock available in the dataset. This might vary from stock to stock. I have included day-log-returns and 3-day-log-returns forecast. I have also included metrics from fitting the model to prior data. These metrics might be considered for measuring the confidence in the forecast. Direction-Accuracy refers to whether the direction-of-movement of the stock was correctly predicted in the prior-data. From the table it is obvious that the modelling is not very good. Direction-Accuracy of day-returns appears to be slightly better than random (random = coin toss), whereas direction-accuracy for 3-day returns is just 35%.
I think, I need to look closer into ARIMA modelling or better look at other ways of modelling/forecasting univariate data. I have neither spent time optimising the parameters for ARIMA in this task nor have I looked at all stocks to generalise the exploration-analysis-modelling-prediction pipeline.
I have now completed exploration, analysis, modelling and prediction of a time-series (specifically stock: X3) for forecasting 3-day returns.
I will now wrap this into a function and produce 3-day returns for all stocks in the dataset.